Universal Codes as a Basis for Nonparametric 
Testing of Serial Independence for Time Series 



Boris Ryabko* and Jaakko Astola^ 
*Institute of Computational Technology of Siberian Branch of Russian Academy of Science, 
T Tampere University of Technology, Finland, 
Email: boris@ryabko.net, jaakko.astola@tut.fi 



Abstract — We consider a stationary and ergodic source p 
generated symbols xi . . . xt from some finite set A and a null 
hypothesis Ho that p is Markovian source with memory (or 
connectivity) not larger than m, (m > 0). The alternative 
hypothesis H\ is that the sequence is generated by a stationary 
and ergodic source, which differs from the source under Ho. 
In particular, if m — we have the null hypothesis Ho that the 
sequence is generated by Bernoully source (or the hypothesis that 
xi . . . Xt are independent.) Some new tests which are based on 
universal codes and universal predictors, are suggested. 

I. Introduction 

Nonparametric testing for independence of time series is 
very important in statistical applications. There is an extensive 
literature dealing with nonparametric independence testing; a 
quite full review can be found in [10]. 

In this paper, we consider a source (or process), which 
generates elements from a finite set A and two following 
hypotheses: Hq is that the source is Markovian one, which 
memory (or connectivity) not larger than m, (m > 0), and 
the alternative hypothesis Hi that the sequence is generated 
by a stationary and ergodic source, which differs from the 
source under Ho- The testing should be based on a sample 
Xi ... Xt generated by the source. 

For example, the sequence x\ . . .xt might be a DNA-string 
and one can consider the question about the depth of the 
statistical dependence. 

We suggest a family of tests that are based on so called 
universal predictors (or universal data compression methods). 
The Type I error of the suggested tests is not larger than a 
given a (a £ (0, 1) ) for any source under Ho, whereas the 
Type II error for any source under Hi tends to 0, when the 
sample size t grows. 

The suggested tests are based on results and ideas of 
Information Theory and, especially, those of the universal 
coding theory. Informally, the main idea of the tests can be 
described as follows. Suppose that the source generates letters 
from an alphabet A and one wants to test Ho (the source 
is Morkovian with memory m,m > 0. ) First we recall 
that there exist universal codes which, informally speaking 
can "compress" any sequence generated by a stationary and 
ergodic source, to the length thoo bits, where hoo is the 
limit Shannon entropy and t tends to infinity. Second, it is 
well known in information theory that hoo equals mth-order 
(conditional) Shannon entropy h m , if Ho is true, and hoo is 



strictly less than h m if Hi is true. So, the following test looks 
like natural: Compress the sample sequence xi . . .Xt by a 
universal code and compare the lengths of the obtained file 
with ih* m , where h* m is an estimate of h m . If the length of 
the compressed file is significantly less than ih* m , then the 
hypothesis Ho should be rejected. 

This is no surprise that the results and ideas of a universal 
coding theory can be applied to some classical problems of 
mathematical statistics. In fact, methods of universal coding 
(and a closely connected universal prediction) are intended to 
extract information from observed data in order to compress 
(or predict) data efficiently in a case where the source statistics 
is unknown. Recently such a connection between universal 
coding and mathematical statistics was used in [4] for estimat- 
ing the order of Markov sources and for constructing efficient 
tests for randomness, i.e. for testing the hypothesis Ho that a 
sequence is generated by a Bernoulli source and all letters have 
equal probabilities against Hi that the sequence is generated 
by a stationary and ergodic source, which differs from the 
source under Ho, see [19]. 

The outline of the paper is as follows. The next part 
contains definitions and necessary information from the theory 
of universal coding and universal prediction. Part three is 
devoted to the testing of the above described hypotheses. All 
proofs are given in the appendix. 

II. Definitions and Preliminaries. 

Consider an alphabet A = {oi, • • • , a n } with n > 2 letters 
and denote by A t the set of words xi ■ ■ ■ Xt of length t 
from A. Let fi be a source which generates letters from A. 
Formally, /i is a probability distribution on the set of words 
of infinite length or, more simply, /i = (/i')t>i is a consistent 
set of probabilities over the sets A 1 ; t > 1. By M OQ (A) we 
denote the set of all stationary and ergodic sources, which 
generate letters from A. Let Mk(A) C M oa (A) be the set 
of Markov sources with memory (or connectivity) k, k > 0. 
More precisely, by definition fj, 6 Mj~(A) if 

fi(xt+i = a-ijxt = a i2 ,x t -i = a i3 , ... ,x t -k+i = ai k+1 , ■■•) 

= (i(xt+i = ciijxt = a i2 ,x t ^i = a i3 , ... ,x t -k+i = a ik+1 ) 

for all t > k and 0,^,0,^,. . . £ A. By definition, Mo (A) is 
the set of all Bernoulli (or i.i.d.) sources over A. 



2.1 Universal prediction. 

Now we briefly describe results and methods of uni- 
versal coding and prediction, which will be used later. 
Let a source generate a message x± . . .Xt-iXt ■ . ■ and 
let v t (a) denote the count of letter a occurring in the 
word x\ . . . Xt-iXt. After the first t letters xi, . . . , Xt-i,Xt 
have been processed the following letter x t +i needs to 
be predicted. By definition, a prediction is a set of non- 
negative numbers 7r(ai|xi ■ • ■ x t ), ■ • ■ , 7r(a n |xi ■ ■ ■ Xt) which 
are estimates of the unknown conditional probabilities 
p{a\\x\ ■ ■ -Xt), ■ ■ ■ ,p(a n \x\ ■ ■ -xt), i.e. of the probabilities 
p(x t+ i = Oi\xi ■ - -xt); i = l,---,n. 

Laplace suggested the following predictor: 

L(a\x 1 ---x t ) = (v t (a) + l)/(t+\A\), (1) 

where \A\ is the number of letters in the alphabet A, see [8]. 
For example, if A = {0, 1}, X1...X5 — 01010, then the Laplace 
prediction is as follows: L(x 6 = 0|01010) = (3+l)/(5+2) = 
4/7, L Q {x 6 = 1|01010) = (2 + l)/(5 + 2) = 3/7. 

In Information Theory the error of prediction often is esti- 
mated by the the Kullback-Leibler (K-L) divergence between 
a distribution p and its estimation. Consider a source p and a 
predictor 7. The error is characterized by the divergence 

i \ / I \ 1 p(a\x!---x t ) 

p 7lP (xi ■ ■■x t ) = } j p{a\xi • • • x t ) log — —. -. (2) 

aeA l{a\xi ■ ■ -x t ) 

(Here and below log = log 2 .) It is well known that for any 
distributions p and 7 the K-L divergence is nonnegative and 
equals if and only if p(a) = 7(a) for all a, see, for ex., 
[9], that is why the K-L divergence is a natural estimate of 
the prediction error. For fixed t, random variable, 

because x\, X2, ■ • • , Xt are random variables. We define the 
average error at time t by 

P t ip\\i) = E (Py,p(')) = p(x!---x t ) p 1 , p {x 1 ---x t ). 

It is known that the error of Laplace predictor goes to for 
any Bernoulli source p. More precisely, it is proven that 

p*(p||L)<(L4|-l)/(t+l) (3) 

for any source p; [18], [20]. 

Obviously, the convergence to of a predictor's error for 
any source from some set M is an important property. For 
example, we can see from (0 that it is true for the Laplace pre- 
dictor and the set of Bernoulli sources M$(A). Unfortunately, 
it is known that a predictor, which error (|2j goes to for any 
stationary and ergodic source, does not exist. More precisely, 
for any predictor 7 there exists such a stationary and ergodic 
source p, that Iirut_voo sup ( p lt p(xi ■ ■ ■ x t ) > const > with 
probability 1; [17]. (See also [1], [13], [14], where this result 
is generalized and a history of its discovery is described. In 
particular, they found out that such a result was described by 
Bailey [2] in his unpublished thesis). That is why it is difficult 
to use for comparison of different predictors. On the other 
hand, it is shown in [16], [17] that there exists such a predictor 



R, that the following average t" 1 J2l=i Pr,p( x i • • - a?t) g° es t0 
(with probability 1 ) for any stationary and ergodic source 
p, where t goes to infinity. That is why we will focus our 
attention on such averages. First, we define for any predictor 
7r the following probability distribution 

t 

7r(xi ...x t ) = Jj7r(x i |xi...x i _i). 

»=i 

For example, we obtain for the Laplace predictor L that 
L(0101) = HI! = t^, see Q. Then, by analogy with © 
we will estimate the error by K-L divergence and define 

p 7 , p (xi...Xt) = t~ x {\og(p(xi...xt)h{x 1 ...x t )) (4) 

and 

Pt(l,P)= t ~ 1 X! p{x l ...Xt)\og{p{x 1 ...Xt)h{xi...x t )). 

(5) 

For example, from those definitions and (0 we obtain the 
following estimation for Laplace predictor L and any Bernoulli 
source: pt(L,p) < ((\A\ — 1) log i+c)/t, where c is a constant. 

The universal predictors will play a key rule in suggested 
below tests. By definition, a predictor 7 is called a universal (in 
average) for a class of sources M, if for any p £ M the error 
Pt(l,p) goes to 0, where t goes to infinity. A predictor 7 is 
called universal with probability 1, if the error p 1 , p (x\...xt) 
goes to not only in average, but for almost all sequences 
X1X2.... For short, we will say that the predictor (or probability 
distribution) 7 is universal, if \\m t ^oo P^,p{x\...Xt) = 
is valid with probability 1 for any stationary and ergodic 
source (i.e. for any p € M 00(A)). Now there are quite many 
known universal predictors. One of the first such predictors is 
described in [16]. 

2.1 Universal coding. 

This short subparagraph is intended to give some explana- 
tion about why and how methods of data compression can 
be used for testing of independence. The point is that the 
prediction problem is deeply connected with the theory of 
universal coding. Moreover, practically used data compression 
methods (or so-called archivers) can be directly applied for 
testing. 

Let us give some definitions. Let, as before, A be a finite 
alphabet and, by definition, A* = U^Li^" and is 
the set of all infinite words X1X2 • • • over the alphabet A. 
A data compression method (or code) ip is defined as a 
set of mappings ip n such that ip n : A n — * {0,1}*, n = 
1,2,... and for each pair of different words x,y £ A n 
<Pn{x) 7^ <pn(y)- Informally, it means that the code <p can 
be applied for compression of each message of any length 
n, n > over alphabet A and the message can be decoded 
if its code is known. One more restriction is required in 
Information Theory. Namely, it is required that each sequence 
(p n (xi)(p n (x2)---<Pn(x r ), r > 1, of encoded words from the 
set A n ,n > 1, can be uniquely decoded into x\Xi...x r . 
Such codes are called uniquely decodable. For example, let 
A = {a, b}, the code ipi(a) = 0,ipi(b) = 00, obviously, is 



not uniquely decodable. (Indeed, the word 000 can be decoded 
in both ab and ba.) It is well known that if a code ip is 
uniquely decodable then the lengths of the codewords satisfy 
the following inequality (the Kraft inequality): 

£„ eA „ 2-1^)1 < 1 , 

see, for ex., [9]. It will be convenient to reformulate this 
property as follows: 

Claim 1. Let ^ be a uniquely decodable code over an 
alphabet A. Then for any integer n there exists a measure 
[i v on A n such that 

- logp, v (u) < \ip(u)\ (6) 

for any u from A n . (Obviously, it is true for the measure 
H v {u) = 2-l^")l/£„ eA «2-l^ u )l.) It is known in Information 
Theory that sequences x\...x tl generated by a stationary 
and ergodic source p, can be "compressed" till the length 
— logp(xi...x t ) bits. There exist so-called universal codes, 
which, in a certain sense, are the best "compressors" for all 
stationary and ergodic sources. The formal definition is as 
follows: A code ip is universal if for any stationary and ergodic 
source p 

lim t~ 1 {-\ogp{xi...x t ) - \ip(xi...x t )\ = 

t — >oc 

with probability 1. So, informally speaking, the universal 
codes estimate the probability characteristics of the source p 
and use them for efficient "compression". 

III. The Tests. 

In this paragraph we describe the suggested tests. First, we 
give some definitions. Let v be a word v = v\...Vk, k < t, Vi € 
A. Denote the rate of a word v occurring in the sequence 

X X X 2 ■ ■ ■ X k , X 2 X 3 . . . Xk+1, X 3 X 4 . . . X k+2 , . . ., X t -k+l ...Xt 

as v t {v). For example, if x\...x t — 000100 and v = 00, 
then z^ 6 (00) = 3. Now we define for any k > a so-called 
empirical Shannon entropy of order k as follows: 

h* k (x 1 . . .x t ) = (7) 
-wh) £ v\v)Y J {v\va)/9\v))\og{v\va)/v\v)), 

where k < t and v l (v) = J2aeA vt {va). In particular, if k = 
0, we obtain 

h*{x 1 ...x t ) = -^J2 ^( fl ) lo S(^( a )A) . 

aeA 

The suggested test is as follows. 

Let a be any probability distribution over A 1 . By definition, 
the hypothesis H is accepted if 

(t - m)h* m (x 1 ...x t ) - log(l/a( Xl ...x t )) < log(l/a) , (8) 

where a E (0,1). Otherwise, H is rejected. We denote this 
test by T* Qiffim . 

Theorem, i) For any probability distribution (or predictor) 
a the Type I error of the test a m is less than or equal to 

a, a G (0, 1). 



ii) If a is a universal predictor (measure) (i.e., by definition, 

for any p e M^A) 

lim t~ 1 {-\ogp{xi...x t ) - log(l/a(xi...x t )) =0 (9) 

with probability 1), then the Type II error goes to 0, where t 
goes to infinity. 

The proof is given in Appendix. 

Comment. Let ip be a uniquely decodable code (or a data 
compression method). Define the test T^, m as follows: The 
hypothesis H is accepted if 

(t - m)h* m {x 1 ...x t ) ~ \p( Xl ...x t )\ < log(l/a) , (10) 

where a G (0, 1). Otherwise, H is rejected. 

We immediately obtain from the theorem 1 and the claim 
1 the following statement. 

Claim 2. i) For any uniquely decodable code ip the Type I 
error of the test m is less than or equal to a, a G (0, 1). 

ii) If <p is a universal code, then the Type II error goes to 
0, where t goes to infinity. 

IV. Conclusion. 

The described above tests can be based on known universal 
codes (or so-called archivers) which are used for text com- 
pression everywhere. It is important to note that, on the one 
hand, the universal codes and archivers are based on results of 
Information Theory, the theory of algorithms and some other 
branches of mathematics; see, for example, probability [7], 
[11], [12], [15], [21]. On the other hand, the archivers have 
shown high efficiency in practise as compressors of texts, 
DNA sequences and many other types of real data. In fact, 
the archivers can find many kinds of latent regularities, that is 
why they look like a promising tool for independence testing 
and its generalizations. 

The natural question is a possibility of generalization of the 
suggested tests for a case of an infinite source alphabet A (say, 
A is a metric space.) Apparently, such a generalization can be 
done for a case of independence testing, if we will use known 
methods of partitioning; [5], [6]. But we do not know how 
to generalize the suggested tests for a case where H is that 
the source is Markovian. The point is that the partitioning can 
increase the source memory. For example, even if the alphabet 
A contains three letters and we combine two of them in one 
subset (i.e. a new letter) the memory of the obtained source 
can increase till infinity. Hence, the generalization to Markov 
sources with infinite alphabet can be considered as an open 
problem. 



V. Appendix. 

Proof of Theorem. First we show that for any Bernoulli 
source t* and any word xi . . . Xt E j4*, t > 1, the following 
inequality is valid: 

T*(x 1 ...x t )= Y[r(af^ < Y[(Aa)/t) vHa) (11) 



aeA 



aeA 



Indeed, the equality is true, because r* is a Bernoulli mea- 
sure. The inequality follows from the well known inequality 
Yl a eAP( a ) l°g(P( a )/<7( a )) — 0, for K-L divergence, which 
is true for any distributions p and q (see, for ex., [9]). So, if 
p(a) = v t (a)/t and q(a) — r*(a), then 



aeA 



i/*(o),_ fr'(a)A) 
r(a) 



> 0. 



From the last inequality we obtain il It . 

Let now r belong to M m (A),m > 0. We will prove that 
for any x% . . . Xt 

r{ Xl ...x t )< J] n^M/P^u))^^). (12) 
Indeed, we can present r(xi . . . x t ) as 

T( X1 ...x t )=T 00 (x 1 ...x m ) n n Tw t(uo) , 

ueA™ aeA 

where Too (xi . . . x m ) is the limit probability of the word 
x\ . . . x m . From the last equality we can see that 

r( Xl ...x t )< [] Y[r(a/uf^ a K 

ueA m aeA 

Taking into account the inequality ( II It . we obtain 

Y[ T(a/u)" t(ua) < n(^M/ pt ("))" t(uo) 



aeA 



aeA 



for any word w. So, from the last two inequalities we obtain 

Gi- 
lt will be convenient to define an auxiliary measure on A 1 
as follows: 

ir m ( Xl ...x t ) = A2- th ^- Xt \ (13) 
where x\...xt G A 1 and 

A=( y ' 2 _ *' 1 m( a; l--- a; *) ) — 1 

ii...i f eA' 

If we take into account that 

2 -(t-m)h- m (x 1 ...x t ) = J| J|(^( wa )/p*( u ))^(««) ) 

ueA m aeA 

we can see from (II 21 and Jl 3i that, for any measure t G 
M m (A) and any xi . . . xt G A', 



r(xi . . .x t ) < 7r m (xi...x t )/A 



(14) 



Let us denote the critical set of the test a m as C a i.e., by 
definition, 



> log(l/a)}. (15) 

From dl4> and this definition we can see that for any measure 

r G M m (A) 

T{C a ) <n rn (C a )/A. (16) 
From the definitions dl5> and Jl 3i we obtain 

C a = { Xl ...x t : 2 <*-"0 h «("*™**) > (a ^...xt))- 1 } 
= {x% . . .x t : (7r m (xi . . . Xi)/A) _1 > (a er(xi . . . x 4 )) -1 } . 
Finally, 



C a = {xi . . . x t : cr(xi ...x t ) > 7r m (xi . . . x t )/ (a A)}. 



(17) 



The following chain of inequalities and equalities is valid: 



1 > 



x 1 ...x t ec, 



a(xi . . .x t ) > 



x 1 ...x t ec c 



7r m (xi . . .x t )/(a A) 



= 7r m (C Q )/(a A) > r(C a )A/(aA) = r(C a )/a. 

(Here both equalities and the first inequality are obvious, the 
second inequality and the third one follow from (I17> and 
(I16> . correspondingly.) So, we obtain that r(C a ) < a for 
any measure r G M m (A). Taking into account that C a is 
the critical set of the test, we can see that the probability of 
the Type I error is not greater than a. The first claim of the 
theorem is proven. 

The proof of the second statement of the theorem will be 
based on some results of Information Theory. The t— order 
conditional Shannon entropy is defined as follows: 

htijp) = - E P( x l- X t) 



E p{a/x\...x t ) logp(a/xi...x t ), 



(18) 



aeA 



where p G M co (A). It is known that for any p G M oc (A) 
firstly, log | A \ > ho(p) > h\(p) > secondly, there exists 
the following limit Shannon entropy h oc (p) = lim^oo h t (p), 
thirdly, lim^oo — t^ 1 \ogp(xi...x t ) = h oc (p) with the prob- 
ability 1 and, finally, h m (p) is strictly greater than h OQ (p), if 
the memory of p is larger m, (i.e. p G Moo (A) \ M m (A)), 
see, for example, [3], [9]. 

Taking into account the definition of the universal predictor 
(see (|9)), we obtain from the above described properties of the 
entropy that 



lim -t 1 logcr(xi...x t ) = haoip) 



(19) 



with probability 1. It can be seen that h* n (II Oi is a con- 
sistent estimate for the m— order Shannon entropy ( 1181 . i.e. 
linit—xx, h* n (xi . . . Xt) — h m {p) with probability 1; see [3], 
[9]. Having taken into account that h m (p) > h QO (p) and 
( II 91 we obtain from the last equality that lim f ^ oc ((t — 
m) /i„(xi ...Xt) — log(l/a(xi...Xt))) = oo. This proves the 
second statement of the theorem. 



C a = {xi . . .x t : (t- m) h* m {xi .. .x t ) - log(l/<r(xi...x t )) 
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